Here, we will cover how to run linear models (LMs; e.g., Gaussian/Normal distribution), generalized linear models (GLMs; e.g., binomial, Poisson for count data), and mixed models (LMMs/GLMMs; e.g., fixed vs. random effects). We will also expand on models using Poisson-distributed data by evaluating how to deal with common issues such as when your count data are over-dispersed/under-dispersed (e.g., Poisson vs. Negative Binomial vs. Conway-Maxwell-Poisson), when counts should be represented as rates (e.g., counts per unit time using Poisson offsets), and when your count data are zero-inflated (e.g., 0-inflated regression vs hurdle models). Finally, we will finish the session on appropriate ways to run model selection techniques for finding a winning model from a set of candidate models, using Akaike Information Criteria (AIC) or likelihood ratio tests (LRT) for nested models. This course will not be a statistics course, so people will need to be familiar with most of these models.
Before running the following code, please open the R Project for the EntSoc R Webinar series in the main folder (“EntSoc_R-Webinar_2020.Rproj”).
Here, is a list of packages required for this R course. You will need to install these prior to the class, either install from the “packages” panel in R studio or using the function below.
install.package("")
Once installed, we can run these packages in advance. I will inform you whenever we are running a function from each of these packages throughout this session.
library(here) # for navigation among folders
library(tidyverse) # for all tidyverse packages
library(pscl) # for zero-inflated regression models
library(glmmTMB) # for both zero-inflated and hurdle models
library(lme4) # for mixed models
library(boot) # testing assumptions of the Gamma distribution
library(car) # for likelihood ratio tests/marginal hypothesis testing
library(lmtest) # for likelihood ratio tests
library(bbmle) # for AIC
Today, we will cover six main topics in this workshop:
We will cover alternative statistical distributions to the Normal and Poisson distributions, when encountering common issues with these data.
For background reading that is required for this course, I would highly recommend reading:
I have listed some other useful resources at the end of this rmarkdown file. Not always one way of describing something will make sense to you, so often it takes finding the right teacher or resource for these concepts to click.
First, we will cover running linear models (LMs) for data that follows a normal (or Gaussian) distribution. Linear models can be used to carry out:
… that we will cover today. For these models, we will use Brown hare (Lepus europaeus) data over 17 years (1992-2008) at 56 sites in 8 regions of Switzerland for many of our examples today. These sites vary in area, elevation, and belong to two habitat types (arable and grassland). Mean density is the count1 of hares offset by the area of the site (i.e., mean.density = count1/area). These data are used in the 2010 Marc Kery book that contains examples of both R and WinBUGS code.
hares <- read.table(here::here("Session 2", "Data", "hares.txt"), header = T)
head(hares)
## no site region area elevation landuse year count1 count2 mean.density
## 1 1 AG01 CH.Central 2.23 384 arable 1992 NA NA NA
## 2 2 AG01 CH.Central 2.23 384 arable 1993 NA NA NA
## 3 3 AG01 CH.Central 2.23 384 arable 1994 NA NA NA
## 4 4 AG01 CH.Central 2.23 384 arable 1995 6 4 2.690583
## 5 5 AG01 CH.Central 2.23 384 arable 1996 7 5 3.139013
## 6 6 AG01 CH.Central 2.23 384 arable 1997 3 3 1.345291
First, we will run a single stratum analysis of variance (aka “intercept-only”) model to estimate the mean density of Brown hares across Switzerland.
dens1 <- lm(mean.density ~ 1, data = hares)
summary(dens1)
##
## Call:
## lm(formula = mean.density ~ 1, data = hares)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6137 -2.7488 -0.9541 1.5672 17.3565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7366 0.1467 32.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.817 on 676 degrees of freedom
## (275 observations deleted due to missingness)
Interpreting the summary output:
mean(dens1$residuals) # mean of the residuals is close to zero
## [1] 4.079368e-16
mean(hares$mean.density, na.rm = T) # Mean
## [1] 4.736569
sd(hares$mean.density, na.rm = T)/sqrt(nrow(subset(hares, !is.na(mean.density)))) # SE = sd/sqrt(n)
## [1] 0.1466952
summary(dens1)$sigma / summary(dens1)$coefficient[1] # 80% percentage error
## [1] 0.8058354
Next, we will run a one-way analysis of variance (ANOVA) model to estimate mean density of brown hares in the two habitat types: arable vs. grassland.
dens2 <- lm(mean.density ~ landuse, data = hares)
summary(dens2)
##
## Call:
## lm(formula = mean.density ~ landuse, data = hares)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1593 -2.6117 -0.9307 1.7159 16.7613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3317 0.1672 31.893 < 2e-16 ***
## landusegrass -2.1432 0.3172 -6.756 3.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.697 on 675 degrees of freedom
## (275 observations deleted due to missingness)
## Multiple R-squared: 0.06333, Adjusted R-squared: 0.06194
## F-statistic: 45.64 on 1 and 675 DF, p-value: 3.072e-11
The lm() summary also contains three more outputs:
summary(dens2)$r.squared # only 6.3% of landuse explains the variance in mean density
## [1] 0.06333232
TSS <- sum((hares$mean.density-mean(hares$mean.density, na.rm = T))^2, na.rm = T) # total sum squares
RSS <- sum(dens2$residuals^2) # residual sum squares (difference between y values and the two means)
(TSS-RSS)/TSS # r-squared
## [1] 0.06333232
For this model, we used what is referred to as “effect parameterization”. The dummy variable (or Intercept, \(\beta_{0}\)) is arable land, and the landusegrass is difference to grassland relative to the dummy variable (or slope, \(\beta_{1}\)).
coef(dens2) # Intercept (i.e., arable estimate) and slope (i.e., difference between arable and grassland)
## (Intercept) landusegrass
## 5.331719 -2.143173
The mean density in the arable land is 5.3317186 and the mean density in the grassland is 3.1885457, i.e., \(y = \beta_{0} + \beta_{1} x\) where x is either “1” for grassland or “0” for arable.
\[ y = \beta_{0} + \beta_{1} x \]
subset(hares, !is.na(mean.density))[70:80,]
## no site region area elevation landuse year count1 count2 mean.density
## 105 105 BE03 Aare 1.12 557 grass 1994 1 NA 0.8928571
## 106 106 BE03 Aare 1.12 557 grass 1995 1 1 0.8928571
## 109 109 BE03 Aare 1.12 557 grass 1998 1 1 0.8928571
## 111 111 BE03 Aare 1.12 557 grass 2000 2 1 1.7857143
## 120 120 BE04 Aare 2.77 532 arable 1992 5 7 2.5270758
## 121 121 BE04 Aare 2.77 532 arable 1993 8 10 3.6101083
## 122 122 BE04 Aare 2.77 532 arable 1994 1 6 2.1660650
## 123 123 BE04 Aare 2.77 532 arable 1995 9 8 3.2490975
## 126 126 BE04 Aare 2.77 532 arable 1998 6 8 2.8880866
## 128 128 BE04 Aare 2.77 532 arable 2000 21 8 7.5812274
## 133 133 BE04 Aare 2.77 532 arable 2005 1 2 0.7220217
model.matrix(dens2)[70:80,] # each row is an observation used to find MLE for landuse
## (Intercept) landusegrass
## 105 1 1
## 106 1 1
## 109 1 1
## 111 1 1
## 120 1 0
## 121 1 0
## 122 1 0
## 123 1 0
## 126 1 0
## 128 1 0
## 133 1 0
effects_par <- lm(mean.density ~ 1 + landuse, data = hares)
summary(effects_par)$coefficients # same model as dens2, formula assumes "~ 1 +"
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.331719 0.1671745 31.893138 7.712778e-137
## landusegrass -2.143173 0.3172381 -6.755723 3.071940e-11
However, we can also use the “means parameterization” approach for our model structure, where each group is not in reference to the “dummy” variable.
means_par <- lm(mean.density ~ -1 + landuse, data = hares) # removes dummy variable
summary(means_par)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## landusearable 5.331719 0.1671745 31.89314 7.712778e-137
## landusegrass 3.188546 0.2696159 11.82625 1.846049e-29
model.matrix(effects_par)[70:80,] # each row is an observation used to find MLE
## (Intercept) landusegrass
## 105 1 1
## 106 1 1
## 109 1 1
## 111 1 1
## 120 1 0
## 121 1 0
## 122 1 0
## 123 1 0
## 126 1 0
## 128 1 0
## 133 1 0
model.matrix(means_par)[70:80, ]
## landusearable landusegrass
## 105 0 1
## 106 0 1
## 109 0 1
## 111 0 1
## 120 1 0
## 121 1 0
## 122 1 0
## 123 1 0
## 126 1 0
## 128 1 0
## 133 1 0
Practice example 1: We will be using the built-in “ToothGrowth” R dataset. These data from an experiment studying the effect of vitamin C on tooth growth in 60 Guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods: orange juice (OJ) or ascorbic acid (VC).
?ToothGrowth
attach(ToothGrowth)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
With these data, run an ANOVA to estimate tooth growth for the three dosages of vitamin C. Run two models using the effects and means parameterization approach.
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
tg_effects <- lm(len ~ factor(dose), data = ToothGrowth)
summary(tg_effects)
##
## Call:
## lm(formula = len ~ factor(dose), data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6000 -3.2350 -0.6025 3.3250 10.8950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.6050 0.9486 11.180 5.39e-16 ***
## factor(dose)1 9.1300 1.3415 6.806 6.70e-09 ***
## factor(dose)2 15.4950 1.3415 11.551 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.242 on 57 degrees of freedom
## Multiple R-squared: 0.7029, Adjusted R-squared: 0.6924
## F-statistic: 67.42 on 2 and 57 DF, p-value: 9.533e-16
tg_means <- lm(len ~ -1 + factor(dose), data = ToothGrowth)
coef(tg_means)
## factor(dose)0.5 factor(dose)1 factor(dose)2
## 10.605 19.735 26.100
Hint: you may want to use str() to look at the structure of the dataset.
Here, we will run a two-way ANOVA without an interactive effect of both landuse and region on mean density of Brown hares.
dens4 <- lm(mean.density ~ region + landuse, data = hares)
coef(dens4)
## (Intercept) regionBaselland regionCH.Central regionCH.E
## 4.9529579 -1.4585568 -1.9953734 -0.5712674
## regionCH.N regionCH.SW regionCH.W regionRhone
## 0.3031590 4.6676222 0.1660742 -1.1703031
## landusegrass
## -0.8073997
dens5 <- lm(mean.density ~ -1 + region + landuse, data = hares)
coef(dens5) # mildly more comprehensible
## regionAare regionBaselland regionCH.Central regionCH.E
## 4.9529579 3.4944011 2.9575845 4.3816905
## regionCH.N regionCH.SW regionCH.W regionRhone
## 5.2561169 9.6205801 5.1190322 3.7826548
## landusegrass
## -0.8073997
By removing the dummy variable for region (“dens5”), we can see that each region has an estimate for mean density of brown hares in arable land. To obtain a grassland estimate for each region, you will still need to add landusegrass: rcoef(dens5)[9] to each region estimate.
coef(dens5)[1:8] # estimate for arable land for all regions
## regionAare regionBaselland regionCH.Central regionCH.E
## 4.952958 3.494401 2.957585 4.381691
## regionCH.N regionCH.SW regionCH.W regionRhone
## 5.256117 9.620580 5.119032 3.782655
coef(dens5)[1:8] + coef(dens5)[9] # estimate for grassland for all regions
## regionAare regionBaselland regionCH.Central regionCH.E
## 4.145558 2.687001 2.150185 3.574291
## regionCH.N regionCH.SW regionCH.W regionRhone
## 4.448717 8.813180 4.311633 2.975255
Next, we will run a two-way ANOVA with interactive effects of both vitamin C dosage and supplement type on tooth growth in guinea pigs.
grow1 <- lm(len ~ supp + factor(dose) + supp:factor(dose), data = ToothGrowth)
coef(grow1)
## (Intercept) suppVC factor(dose)1
## 13.23 -5.25 9.47
## factor(dose)2 suppVC:factor(dose)1 suppVC:factor(dose)2
## 12.83 -0.68 5.33
grow2 <- lm(len ~ supp * factor(dose), data = ToothGrowth) # same model as grow1
coef(grow2)
## (Intercept) suppVC factor(dose)1
## 13.23 -5.25 9.47
## factor(dose)2 suppVC:factor(dose)1 suppVC:factor(dose)2
## 12.83 -0.68 5.33
Practice example 2: For this practice example, we will be using the data are from Lampe et al. 2013. Grasshopper nymphs were relocated from seven roadside (n = 112) and five non-roadside (n = 65) habitats (i.e., “Origin”) to a laboratory. Half from each habitat were reared in either noisy or quiet conditions (i.e., “Treatment”) for a full two-by-two factorial design. Then, the adult males were weighed (i.e., “BodyMass”) and their courtship songs (i.e., “LocMax”) recorded.
ghp <- read.csv(here::here("Session 2", "Data", "GrasshopperSong.csv"))
head(ghp)
## Site Individual Origin Treatment LocMax BodyMass
## 1 Asch AS_lei_la_0001 non-roadside noisy 7193 0.084
## 2 Asch AS_lei_la_0001 non-roadside noisy 7239 0.084
## 3 Asch AS_lei_la_0001 non-roadside noisy 7358 0.084
## 4 Asch AS_Lei_la_0002 non-roadside noisy 8958 0.086
## 5 Asch AS_Lei_la_0002 non-roadside noisy 8452 0.086
## 6 Asch AS_Lei_la_0002 non-roadside noisy 8958 0.086
Using the grasshopper song data, run a two-way ANOVA looking at the effects of origin population (i.e., non-roadside or roadside) and rearing treatment (i.e., noisy vs quiet) on male Bow-winged grasshopper song.
gh1 <- lm(LocMax ~ Origin*Treatment, data = ghp)
summary(gh1)
##
## Call:
## lm(formula = LocMax ~ Origin * Treatment, data = ghp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1862.3 -542.5 -153.7 270.7 4421.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7168.84 93.47 76.698 < 2e-16 ***
## Originroadside 437.85 114.68 3.818 0.000151 ***
## Treatmentquiet -192.88 129.83 -1.486 0.137979
## Originroadside:Treatmentquiet -115.48 163.19 -0.708 0.479463
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 896.5 on 519 degrees of freedom
## Multiple R-squared: 0.06468, Adjusted R-squared: 0.05928
## F-statistic: 11.96 on 3 and 519 DF, p-value: 1.384e-07
gh2 <- lm(LocMax ~ -1 + Origin + Treatment + Origin:Treatment, data = ghp)
coef(gh2)
## Originnon-roadside Originroadside
## 7168.8370 7606.6868
## Treatmentquiet Originroadside:Treatmentquiet
## -192.8774 -115.4828
A simple linear regression is an approach for modelling the relationship between a single continuous predictor variable \(x_{i}\) on a continuous response variable \(y_{i}\), as follows:
\[ y_{i} = \beta_{0} + \beta_{1}x_{i} \] where the model describes a line with some intercept \(\beta_{0}\) and y-intercept $_{1}, such that the linear function provides a “best-fitting” line between the two variables.
Here, we will use the built-in dataset from an experiment evaluating the effects of diet on early growth of chicks to run a simple linear regression. Chicks were given one of four diets and weighed each day since birth (“Time”) for approximately 21 days.
attach(ChickWeight)
head(ChickWeight)
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
cw1 <- lm(weight ~ Time, data = ChickWeight)
summary(cw1)
##
## Call:
## lm(formula = weight ~ Time, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -138.331 -14.536 0.926 13.533 160.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.4674 3.0365 9.046 <2e-16 ***
## Time 8.8030 0.2397 36.725 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.91 on 576 degrees of freedom
## Multiple R-squared: 0.7007, Adjusted R-squared: 0.7002
## F-statistic: 1349 on 1 and 576 DF, p-value: < 2.2e-16
pred.lin <- coef(cw1)[1] + coef(cw1)[2]*0:21
with(ChickWeight, plot(Time, weight, xlab = "Days since birth"))
points(0:21, pred.lin, type = "l", col = "blue")
We can also run n-degree polynomial linear functions. I have choosen to only run a second-degree polynomial relationship (e.g., quadratic function) to evaluate whether there is an intermediate elevation that has the highest mean density of brown hares.
cw2 <- lm(weight ~ Time + I(Time*Time), data = ChickWeight)
summary(cw2)
##
## Call:
## lm(formula = weight ~ Time + I(Time * Time), data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147.952 -12.507 0.518 11.126 151.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.13394 4.08288 9.340 < 2e-16 ***
## Time 5.45963 0.89962 6.069 2.34e-09 ***
## I(Time * Time) 0.15684 0.04071 3.852 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.45 on 575 degrees of freedom
## Multiple R-squared: 0.7083, Adjusted R-squared: 0.7073
## F-statistic: 698 on 2 and 575 DF, p-value: < 2.2e-16
pred.quad <- coef(cw2)[1] + coef(cw2)[2]*0:21 + coef(cw2)[3]*(0:21)^2
with(ChickWeight, plot(Time, weight, xlab = "Days since birth"))
points(0:21, pred.quad, type = "l", col = "red") # quadratic function
points(0:21, pred.lin, type = "l", col = "blue") # linear function
Note that we need to use the function I() for the quadratic term, which treats the variable “as is” rather than an interaction between two variables (as seen in the two-way ANOVA with an interaction).
Practice example 3: Using the built-in “ToothGrowth” R dataset, run linear regression to determine whether tooth growth changes linearly with vitamin C dosage. Then, plot the raw data and model predicted values.
tg1 <- lm(len ~ dose, data = ToothGrowth)
summary(tg1)
##
## Call:
## lm(formula = len ~ dose, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4496 -2.7406 -0.7452 2.8344 10.1139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.4225 1.2601 5.89 2.06e-07 ***
## dose 9.7636 0.9525 10.25 1.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.601 on 58 degrees of freedom
## Multiple R-squared: 0.6443, Adjusted R-squared: 0.6382
## F-statistic: 105.1 on 1 and 58 DF, p-value: 1.233e-14
x.vals <- c(0.5, 1, 2)
with(ToothGrowth, plot(dose, len)) # plot(ToothGrowth$dose, ToothGrowth$len)
points(x.vals, coef(tg1)[1] + coef(tg1)[2]*x.vals, type = "l")
An analysis of covariance (ANCOVA) combines both categorical (e.g., ANOVA) and continuous (e.g., linear regression) covariates to predict some response variable.
Using the ChickWeight dataset, we will run an ANCOVA to evaluate the interactive effects of diet and time on chick weights.
cw3 <- lm(weight ~ Diet * Time, data = ChickWeight)
coef(cw3)
## (Intercept) Diet2 Diet3 Diet4 Time Diet2:Time
## 30.9309803 -2.2973848 -12.6806551 -0.1388608 6.8417972 1.7673391
## Diet3:Time Diet4:Time
## 4.5810738 2.8725684
chick.dat <- data.frame(expand.grid(Time = 0:21, Diet = factor(c(1:4))))
chick.dat$pred <- predict(cw3, newdata = chick.dat, type = "response") # use the predict() function
head(chick.dat)
## Time Diet pred
## 1 0 1 30.93098
## 2 1 1 37.77278
## 3 2 1 44.61457
## 4 3 1 51.45637
## 5 4 1 58.29817
## 6 5 1 65.13997
ggplot(ChickWeight, aes(x = Time, y = weight, col = Diet)) + geom_point() +
geom_line(chick.dat, mapping = aes(x = Time, y = pred, col = factor(Diet)))
Practice example 4: Using the built-in data on ToothGrowth, run an ANCOVA to evaluate the interactive effects of supplement type and dose on tooth growth. Then, plot these data.
tg2 <- lm(len ~ dose*supp, data = ToothGrowth)
summary(tg2)
##
## Call:
## lm(formula = len ~ dose * supp, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2264 -2.8462 0.0504 2.2893 7.9386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.550 1.581 7.304 1.09e-09 ***
## dose 7.811 1.195 6.534 2.03e-08 ***
## suppVC -8.255 2.236 -3.691 0.000507 ***
## dose:suppVC 3.904 1.691 2.309 0.024631 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.083 on 56 degrees of freedom
## Multiple R-squared: 0.7296, Adjusted R-squared: 0.7151
## F-statistic: 50.36 on 3 and 56 DF, p-value: 6.521e-16
pred.OJ <- coef(tg2)[1] + coef(tg2)[2]*x.vals + coef(tg2)[3]*0 + coef(tg2)[4]*x.vals*0
pred.VC <- coef(tg2)[1] + coef(tg2)[2]*x.vals + coef(tg2)[3]*1 + coef(tg2)[4]*x.vals*1
pred.dat <- data.frame(dose = rep(x.vals, times = 2), supp = rep(c("OJ", "VC"), each = 3),
pred = c(pred.OJ, pred.VC))
ggplot(ToothGrowth, aes(x = dose, y = len, col = supp)) + geom_point() +
geom_line(pred.dat, mapping = aes(x = dose, y = pred, col = supp))
Before continuing with this statistical distribution, we always want to test for the assumptions of normality on our response variable. A linear regression has four assumptions:
We can test for relevant assumptions using diagnostic plots:
hares$yearPost <- hares$year - 1991 # rescale year to converge on an estimate, scale(year) also works
dens6 <- lm(mean.density ~ yearPost, data = hares)
plot(dens6, 1) # Residuals vs fitted - red line is flat meaning linear relationsip
plot(dens6, 2) # QQ plot - examine whether residuals are normally-distributed
plot(dens6, 3) # Scale-location - homogeneity of variance of residuals (homoscedasticity) if horizontal line with equal spread
plot(dens6, 4) # Cook's distance - identify extreme values and their obs. number
plot(dens6, 5) # Residuals vs leverage - for identifying influential cases or extreme values
hist(hares$mean.density) # left-skewed
It is difficult to do a general test for dependence of the residual error term. You will need to know why your residual error term may be dependent: either residuals can correlate with another variable (e.g., check residuals-fitted plot) or residuals can correlate with a nearby residual (e.g., autocorrelation in time series data).
LI04.grass <- subset(hares, site == "BE12" & !is.na(mean.density))
acf(LI04.grass$mean.density)
LU01.grass <- subset(hares, site == "GE02" & !is.na(mean.density))
acf(LU01.grass$mean.density)
Our residual error terms seem to be independent, particularly from temporal autocorrelation. Regardless, our data do not meet all the assumptions of a linear regression (i.e., normality of residuals since data are negatively (left) skewed and bound by 0 to \(\infty\)).
First, we can use the most common approach when our data do not met the assumptions of normality, which is to transform our response variable, y. There are many ways to transform your data (e.g., log, square-root, arcsine). However, we will only cover the log transformation of our response variable, y (i.e., log(y) is normal given x):
\[ ln(y_{i}) = \beta_{0} + \beta_{1}x_{i} \]
dens7 <- lm(log(mean.density) ~ yearPost, data = hares)
plot(dens7, 1) # linear relationship between x and y (good)
plot(dens7, 2) # residuals are not normality-distributed, mild right-skew (ok)
plot(dens7, 3) # mild constant variance of the residuals (ok)
hist(hares$mean.density) # negatively/left skewed
hist(log(hares$mean.density)) # slightly positively/right skewed
Instead of transforming our response to fit a statistical distribution (i.e., log(y) is normal given x), we can choose a statistical distribution that fits our data. First, we might want to try the log-link Gaussian distribution (i.e., mean of log(y) responses linearly to x), such that:
\[ln(\mu) = \beta_{0} + \beta_{1}x \]
dens8 <- glm(mean.density ~ yearPost, family = gaussian(link = "log"), data = hares) # glm() instead of lm()
plot(dens8, 1) # relationship between x and y seems linearly (good)
plot(dens8, 2) # residuals are not normality-distributed, left-skewed (bad)
plot(dens8, 3) # constant variance of the residuals (ok)
Compared to log-transformation of the y values, the log-link Gaussian does not seem to perform better when checking the assumptions of a linear regression.
We could also use the Gamma distribution that allows for non-negative, skewed, continuous data that are bound by 0 to \(\infty\). The Gamma distribution (either log Gamma or inverse Gamma) assumes heavier tailed/skewed distribution, particularly the inverse Gamma distribution.
dens9 <- glm(mean.density ~ yearPost, family = Gamma(link = "log"), data = hares)
gamma.diag <- boot::glm.diag(dens9)
boot::glm.diag.plots(dens9, gamma.diag)
The plot(dens9) does not provide the correct diagnostic plots for the Gamma distribution. The “boot” package allows us to assess the relevant assumptions for the Gamma distribution. The Gamma distribution (or any distribution with a scale term) does not need to worry about heteroscedasticity.
Practice example 5: Test the assumptions of normality for the simple linear regression model that was run in code chunk #17. Then, re-run the model with a Gamma distribution and evaluate the assumptions.
cw1 <- lm(weight ~ Time, data = ChickWeight)
cw1 <- lm(weight ~ Time, data = ChickWeight)
plot(cw1, 1) # yes
plot(cw1, 2) # no, too peaks in the middle
plot(cw1, 3) # no, heteroscedasicity
plot(cw1, 4) # no, outliners
hist(ChickWeight$weight)
cw1_lg <- glm(weight ~ Time, family = Gamma(link = "log"), data = ChickWeight)
cw1_lg_diag <- glm.diag(cw1_lg)
glm.diag.plots(cw1_lg, cw1_lg_diag)
Here, I will outline the general procedure for model selection in two parts: selecting the appropriate probability distribution and selecting variables in your model.
Source: B. Bolker (2008) Ecological Models and Data in R, Princeton University Press, Princeton, New Jersey, USA
We could also select the appropriate probability distribution from analyzing our data. To recap, we ran four models to evaluate whether mean density of brown hares changes linearly with year. We can use Akaike Information Criteria (AIC) to rank our candidate models:
\[ AIC = 2k - 2LL \] where k is the number of parameters in the model and LL is the log likelihood of the model. AIC decreases with increasing model fit (LL) but penalizes for the number of parameters. The lower the AIC, the better the model. Here, we can use the ICtab function from the bbmle package.
# from testing our assumptions
dens7 <- lm(mean.density ~ yearPost, data = hares)
dens8 <- lm(log(mean.density) ~ yearPost, data = hares)
dens9 <- glm(mean.density ~ yearPost, family = gaussian(link = "log"), data = hares)
dens10 <- glm(mean.density ~ yearPost, family = Gamma(link = "log"), data = hares)
ICtab(dens7, dens8, dens9, dens10, base = T) # log transformation wins
## AIC dAIC df
## dens8 1800.9 0.0 3
## dens10 3391.6 1590.7 3
## dens9 3738.4 1937.5 3
## dens7 3738.5 1937.6 3
# from practice 5
cw1 <- lm(weight ~ Time, data = ChickWeight)
cw1_lg <- glm(weight ~ Time, family = Gamma(link = "log"), data = ChickWeight)
ICtab(cw1, cw1_lg) # gamma wins
## dAIC df
## cw1_lg 0.0 3
## cw1 497.2 3
The common rule would be to choose the most parsimonious model within 2 dAIC of the winning model.
Note: If multiple models fall witin 2 dAIC of the winning model, you can choose to do something referred to as “model averaging”. Model averaging is not ideal for model with different probability distributions. It would be better for averaging multiple winning models, such a linear and quadratic function since the model falls somewhere between those functions. The bbmle::ICtab() function has an argument for computing IC weights that allows for model averaging. For more information on model averaging, see Anderson and Burham 2002.
For selecting variables in your model, you can do model selection in two ways: Akaike Information Criteria (AIC) for competing models or likelihood ratio test (LRT) of nested models.
We can also use AIC values to select a winning model based on a set of candidate models with different covariates.
Here, we are using epidemiological data from Vicente et al. 2006 that took parasite loads in red deer reared on a number of estates in Spain. Observations on red deer were taken at different farms, months, year (0-5 are 2000-2005, 99 is 1999), and sexes (1 - Male, 2 - Female). For each observation, the length of the animal (LCT, length of head-body), kidney-fat index (KFI), the number of Elaphostrongylus cervi parasites (Ecervi) per grams of wet feces, and the presence (0, 1) of Tuberculosis were taken.
Here, we might evaluate whether infection intensity of Elaphostrongylus cervi parasites in red deer correlated better with length of the animal (LCT) or kidney fat index (KFI).
deer <- read.table(here::here("Session 2", "Data", "Deer.txt"), header = T)
hist(deer$Ecervi)
hist(log(deer$Ecervi))
ec1 <- lm(log(Ecervi + 0.01) ~ KFI, data = deer)
ec2 <- lm(log(Ecervi + 0.01) ~ LCT, data = deer)
ICtab(ec1, ec2) # KFI is better than LCT at explaining parasite load
## dAIC df
## ec1 0.0 3
## ec2 856.4 3
For nested models, we should use “marginal null hypothesis testing” of a saturated model to determine significant effects of nested terms.
For example, we might want to evaluate whether Taylor Checkerspot caterpillars reared on two host plant (i.e., Indian paintbrush and English Plantain) that were either sprayed or not sprayed with herbicide treatment affected their mass.
herb <- read.csv(here::here("Session 2", "Data", "HerbicideCaterpillars.csv"))
head(herb)
## Species Host Treat Mass
## 1 TCB Castilleja Control 4.53
## 2 TCB Castilleja Control 3.67
## 3 TCB Castilleja Control 5.84
## 4 TCB Castilleja Control 2.00
## 5 TCB Castilleja Control 4.25
## 6 TCB Castilleja Control 5.44
hist(herb$Mass)
sat_mod <- glm(Mass ~ Host + Treat + Host:Treat, family=gaussian, data = herb)
car::Anova(sat_mod)
## Analysis of Deviance Table (Type II tests)
##
## Response: Mass
## LR Chisq Df Pr(>Chisq)
## Host 155.777 1 < 2e-16 ***
## Treat 6.200 1 0.01278 *
## Host:Treat 0.168 1 0.68177
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add_mod <- glm(Mass ~ Host + Treat, family=gaussian, data = herb)
host_mod <- glm(Mass ~ Host, family=gaussian, data = herb)
treat_mod <- glm(Mass ~ Treat, family=gaussian, data = herb)
lmtest::lrtest(add_mod, sat_mod)
## Likelihood ratio test
##
## Model 1: Mass ~ Host + Treat
## Model 2: Mass ~ Host + Treat + Host:Treat
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -537.33
## 2 5 -537.25 1 0.1705 0.6797
lrtest(host_mod, add_mod) # does treatment improve model fit for the additive model?
## Likelihood ratio test
##
## Model 1: Mass ~ Host
## Model 2: Mass ~ Host + Treat
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -540.44
## 2 4 -537.33 1 6.2162 0.01266 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(treat_mod, add_mod) # does host plant improve model fit for the additive model?
## Likelihood ratio test
##
## Model 1: Mass ~ Treat
## Model 2: Mass ~ Host + Treat
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -600.07
## 2 4 -537.33 1 125.48 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Generalized linear models (GLMs) connect a mean of the response to its predictors in a linear way through “link functions”. Therefore, it produces coefficients of a linear relationship on the link function scale. Instead of transforming data to fit a normal distribution, such as the log transformation:
\[ln(y_{i}) = \beta_{0} + \beta_{1}x_{i} \]
\[\mu_{ln(y)} = \beta_{0} + \beta_{1}x_{1} \]
GLMs allows data to follow alternative error distributions, such as a Poisson error distribution on a natural log link function scale:
\[ln(\mu_{y}) = \beta_{0} + \beta_{1}x_{1} \] where the log of mean \(\mu_{y}\) is a linear function of x.
Before explaining the Binomial distribution, we should explain the Bernoulli distribution. The Bernoulli distribution describes the probability of getting a single event in a single trial (or a particular sequence of results in some number of trials), whereas the binomial distribution describes the probability of getting k events out of N unordered trials, if each trial has a Bernoulli distribution with probability p of an event. Therefore, a binomial distribution with only one trial is a Bernoulli distribution.
Below, we will use generalized linear models to estimate a response that is binomially-distributed. A binomial response can either be at the individual-level (i.e., presence/absence where each “0” or “1” is an independent Bernoulli trial) or at the group-level (e.g., k events and N trials).
Individual-level response
For our first example, we will be using data from a laboratory diapause experiment to estimate the probability of entering diapause as a function of photoperiod. Mustard white butterfly (Pieris oleracea) larvae were reared in five photoperiod treatments, then I reported whether each individual had survived (1 for survival, 0 for died) and whether they diapaused (1 for entering diapause, 0 for eclosing) conditioned on surviving. These data were used in Kerr et al. 2020.
diap <- read.csv(here::here("Session 2", "Data", "MW_DiapausingExp.csv"))
head(diap)
## Treatment Sex Surv Diapause
## 1 11.5 female 1 1
## 2 11.5 male 1 1
## 3 11.5 male 0 NA
## 4 12.5 female 1 1
## 5 12.5 female 1 1
## 6 12.5 female 1 1
We can run a logit-link GLM to estimate the probability of diapausing conditioned on survival as a function of photoperiod, as follows:
pd1 <- glm(Diapause ~ Treatment, family = binomial, data = subset(diap, Sex == "female"))
Anova(pd1)
## Analysis of Deviance Table (Type II tests)
##
## Response: Diapause
## LR Chisq Df Pr(>Chisq)
## Treatment 104.49 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pd1)
##
## Call:
## glm(formula = Diapause ~ Treatment, family = binomial, data = subset(diap,
## Sex == "female"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.66078 0.00913 0.08165 0.24268 1.56334
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 60.4946 9.8336 6.152 7.66e-10 ***
## Treatment -4.3834 0.7154 -6.127 8.93e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 216.53 on 211 degrees of freedom
## Residual deviance: 112.04 on 210 degrees of freedom
## (15 observations deleted due to missingness)
## AIC: 116.04
##
## Number of Fisher Scoring iterations: 7
light.vals <- seq(min(diap$Treatment), max(diap$Treatment), length.out = 50)
pd.logit <- coef(pd1)[1] + coef(pd1)[2]*light.vals # logit-link scale estimates
pd.pred <- plogis(coef(pd1)[1] + coef(pd1)[2]*light.vals) # back-transformed estimates
plot(light.vals, pd.logit, type = "l", ylab = "Diapause probability on the logit-link scale", xlab = "Photoperiod")
plot(light.vals, pd.pred, type = "l", ylab = "Diapause probability", xlab = "Photoperiod")
manual.pred <- exp(coef(pd1)[1] + coef(pd1)[2]*light.vals) / (1 + exp(coef(pd1)[1] + coef(pd1)[2]*light.vals))
tail(pd.pred)
## [1] 0.5610065 0.5054010 0.4496616 0.3951582 0.3431419 0.2946374
tail(manual.pred)
## [1] 0.5610065 0.5054010 0.4496616 0.3951582 0.3431419 0.2946374
First, the summary output includes two more outputs for a GLM:
To compare your null with the proposed, you can calculate the chi-squared value (\(\chi^2 = null deviance - residual deviance\)) and the degrees of freedom (\(df Proposed - df Null\)).
Group-level response
For our second example, we will be using data from an overwintering experiment to estimate the overwinter survival of diapausing pupa of the mustard white butterfly. Each bug dorm had a fall count of diapausing pupa (N, trials) and a spring count of the number of emerging butterflies (k, events). In other words, we evaluate the number of success as the number of emerging butterflies (i.e., k) and the number of failures as the number of non-emerging butterflies (i.e., N - k).
We can run an intercept-only logit-link GLM to estimate the overwinter survival of the mustard white butterfly, as follows:
surv <- read.csv(here::here("Session 2", "Data", "MW_OverwinterSurv.csv"))
head(surv)
## Dorm Fall.count Spring.count
## 1 1 3 1
## 2 3 20 0
## 3 6 8 2
## 4 7 2 0
## 5 11 11 0
## 6 13 11 2
surv$failures <- surv$Fall.count - surv$Spring.count
ws1 <- glm(cbind(Spring.count, Fall.count - Spring.count) ~ 1, family = binomial, data = surv)
summary(ws1)
##
## Call:
## glm(formula = cbind(Spring.count, Fall.count - Spring.count) ~
## 1, family = binomial, data = surv)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3610 -1.0870 -0.6373 0.9063 1.9614
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.900 0.268 -7.089 1.35e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.474 on 19 degrees of freedom
## Residual deviance: 28.474 on 19 degrees of freedom
## AIC: 48.119
##
## Number of Fisher Scoring iterations: 5
plogis(coef(ws1)) # 13% probability of surviving the winter
## (Intercept)
## 0.1300813
exp(coef(ws1)) / (1 + exp(coef(ws1)))
## (Intercept)
## 0.1300813
I think the group-level binary response has many potential applications:
For example, I have used it to estimate the trap color of Venus flytraps based on crude categorization of color along a spectrum from green to dark red.
Figure 2: Trap color of Venus flytraps from four populations (or eight sub-sites) across their range
The colors of traps were one of seven categories ranging from green, green-pink, pink, pink-red, red, red-dark red, and dark red. The total number of trials was 6, where green was 0 successes + 6 failures and dark red was 6 successes + 0 failures.
The following are three Bernoulli trials for three plants:
cbind(0, 6) # green cbind(3, 3) # pink-red cbind(6, 0) # dark red
Figure 3. Trap color of Venus flytraps from four populations (or eight sub-sites) across their range
Another example could be estimating the day of spring emergence for a diapausing insect, where N trials is the 365 days of a year (assuming not a leap year) and k events is the number of days into the year when it emerged. In other words, the number of successes would be days from January 1st that it took to emerge and the number of failures would be the remaining days of the year. Then, you would get some probability (into the year) that an insect would emerge from diapause. You could multiple that probability estimate by the number of days in a year to convert back into “days until spring emergence”.
This binary “group-level” format has many potential ecological applications.
Practice example 6: Using the deer data, run a logit-link GLM to estimate the probability that a red deer will have the parasite Elaphostrongylus cervi given its kidney fat index (KFI) and sex. Run marginal hypothesis testing to evaluate the interactive effects of these two covariates.
deer$PEcervi <- deer$Ecervi
deer$PEcervi[deer$PEcervi > 0] <- 1
pec1 <- glm(PEcervi ~ KFI*Sex, family = binomial, data = deer)
Anova(pec1)
## Analysis of Deviance Table (Type II tests)
##
## Response: PEcervi
## LR Chisq Df Pr(>Chisq)
## KFI 0.1575 1 0.6914524
## Sex 11.6140 1 0.0006546 ***
## KFI:Sex 4.1990 1 0.0404484 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, we will be exploring running log-link GLMs using the Poisson error distribution. In simple terms, the Poisson distribution is used for discrete count data that are bound by 0 to \(\infty\).
In more complex terms, the Poisson distribution is like the binomial distribution in that it has a large “unknown” number of trials N and a small probability of an event \(p = \lambda / N\), where \(\lambda\) is the k discrete events (or mean count estimate). The probability of given number of events can be thought of as the probability of counting an individual that occurs over a fixed interval of time or space.
Here, we have counts of wolves in each year from 1982 to 2012 across five US states.
wolves <- read.csv(here::here("Session 2", "Data", "NRMwolves.csv"))
head(wolves)
## year num.wolves MT.wolves WY.wolves ID.wolves OR.wolves WA.wolves
## 1 1982 8 8 NA NA NA NA
## 2 1983 6 6 NA NA NA NA
## 3 1984 6 6 NA NA NA NA
## 4 1985 13 13 NA NA NA NA
## 5 1986 15 15 NA NA NA NA
## 6 1987 10 10 NA NA NA NA
First, we will run a Poisson GLM to evaluate whether the total number of wolves has changed over a 10-year period from 1994-2003.
with(wolves[13:22,], plot(year, num.wolves))
wv10a <- glm(num.wolves ~ year, family = poisson(link = "identity"), data = wolves[13:22,])
coef(wv10a) # year is the change in the # of wolves per year
## (Intercept) year
## -141848.51720 71.15512
wv10b <- glm(num.wolves ~ year, family = poisson(link = "log"), data = wolves[13:22,])
coef(wv10b)
## (Intercept) year
## -486.6604133 0.2463317
exp(coef(wv10b)[2]) # slope is the population growth rate
## year
## 1.279324
wolves$num.wolves[13] # number of wolves in 1994
## [1] 48
exp(coef(wv10b)[2]) * wolves$num.wolves[14] # estimated number of wolves in 1995
## year
## 129.2117
with(wolves[13:22,], plot(year, num.wolves))
points(1994:2003, coef(wv10a)[1] + coef(wv10a)[2]*1994:2003, type = "l", col = "blue") # linear growth
points(1994:2003, exp(coef(wv10b)[1] + coef(wv10b)[2]*1994:2003), type = "l", col = "red") # unlimited expontential growth
ICtab(wv10a, wv10b)
## dAIC df
## wv10a 0.0 2
## wv10b 3.2 2
Second, we will evaluate whether the total number of wolves has changed over the full 30-year period from 1983 to 2012.
with(wolves[2:31,], plot(year, num.wolves))
wv30a <- glm(num.wolves ~ year, family = poisson(link = "log"), data = wolves[2:31,]) # unlimited exp growth
coef(wv30a)
## (Intercept) year
## -318.2374555 0.1620725
wv30b <- glm(num.wolves ~ year + I(year*year), family = poisson(link = "log"), data = wolves[2:31,]) # quadratic function model
coef(wv30b)
## (Intercept) year I(year * year)
## -3.369141e+04 3.347835e+01 -8.314798e-03
with(wolves[2:31,], plot(year, num.wolves))
points(1983:2012, exp(coef(wv30a)[1] + coef(wv30a)[2]*1983:2012), type = "l", col = "blue")
points(1983:2012, exp(coef(wv30b)[1] + coef(wv30b)[2]*1983:2012 + coef(wv30b)[3]*1983:2012*1983:2012), type = "l", col = "red")
ICtab(wv10a, wv10b)
## dAIC df
## wv10a 0.0 2
## wv10b 3.2 2
Practice example 7: Using the “Bombus_eggs.rds” R object file, run a Poisson GLM to estimate the number of new eggs laid in bumblebee colonies (Bombus vosnesenskii) as a function of days of the experiment (“Days”). Evaluate whether a model with a linear or quadratic function better fits these data. Then, plot the raw data with model predicted values from both models.
readRDS() # for loading R object files
eggs <- readRDS(here::here("Session 2", "Data", "Bombus_eggs.rds"))
bb_eggs1 <- glm(new.eggs ~ Days + I(Days*Days), family = poisson, data = eggs)
bb_eggs2 <- glm(new.eggs ~ Days, family = poisson, data = eggs)
Anova(bb_eggs1)
## Analysis of Deviance Table (Type II tests)
##
## Response: new.eggs
## LR Chisq Df Pr(>Chisq)
## Days 316.61 1 < 2.2e-16 ***
## I(Days * Days) 409.78 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(bb_eggs2, bb_eggs1)
## Likelihood ratio test
##
## Model 1: new.eggs ~ Days
## Model 2: new.eggs ~ Days + I(Days * Days)
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -2055.5
## 2 3 -1850.6 1 409.78 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(eggs, plot(Days, new.eggs, pch = 19))
points(30:120, exp(coef(bb_eggs1)[1] + coef(bb_eggs1)[2]*30:120 + coef(bb_eggs1)[3]*30:120*30:120), type = "l")
points(30:120, exp(coef(bb_eggs2)[1] + coef(bb_eggs2)[2]*30:120), type = "l", col = "red")
Here, we will cover how to deal with three common issues that you may encounter with Poisson-distributed data:
A common issue in ecology is when each observation of count data are not always equally in represented in space or time. For example, you may find yourself in a situation where each observation is collected over different lengths of time or space.
Here, we will use the Audubon Christmas bird count 2013 data on Northern Flickers across New England. Since observation periods differ, we want to model our counts are rates (# per hour of observation).
flickers <- read.csv(here::here("Session 2", "Data", "NE_flickers.csv"))
head(flickers)
## Code Name Latitude Longitude hours Count
## 1 CTBA Barkhamsted 41.9123 -72.9884 72.50 0
## 2 CTEW Edwin Way Teale Trail Wood 41.7966 -71.9274 48.50 0
## 3 CTGS Greenwich-Stamford 41.0826 -73.6138 182.50 53
## 4 CTHA Hartford 41.7660 -72.6727 198.00 0
## 5 CTLH Litchfield Hills 41.7703 -73.2724 69.95 0
## 6 CTLS Lakeville-Sharon 41.9449 -73.4399 49.50 0
nf1 <- glm(Count ~ 1, family = poisson, data = flickers)
nf1_offset <- glm(Count ~ 1, offset = log(hours), family = poisson, data = flickers)
ICtab(nf1, nf1_offset)
## dAIC df
## nf1_offset 0.0 1
## nf1 569.1 1
exp(coef(nf1_offset)) # 0.07 birds per hour
## (Intercept)
## 0.07294809
Why “offset = log(hours)” in our model? Well, the Poisson offset can be represented algebraically as follows:
\[ ln(counts/exposure) = \beta_{0} \]
which can be separated out as follows:
\[ ln(counts) - ln(exposure) = \beta_{0} \]
To estimate counts, the formula can be rearranged as follows:
\[ ln(counts) = \beta_{0} + ln(exposure) \]
We can also use offsets for running a process error model for evaluating population dynamics, where the counts in the current time step are offset by the previous time step. Therefore, you get an estimated growth rate for each time step of the series.
wv30a <- glm(num.wolves ~ year, family = poisson(link = "log"), data = wolves[2:31,]) # observation error model
x = 1:30
est.vals <- wolves$num.wolves[2]
for(i in 2:30){
est.vals[i] <- wolves$num.wolves[i-1] * exp(coef(wv30a)[2])^x[i] # simple discrete exp growth, N(t+x) = N(t) * r^x
}
plot(x, est.vals) # exponential growth
wv30b <- glm(num.wolves ~ year + I(year*year),
family = poisson(link = "log"), data = wolves[2:31,]) # quadratic function model
wv30c <- glm(num.wolves[2:31] ~ 1, offset = log(num.wolves[1:30]),
family = poisson(link = "log"), data = wolves) # process error model
exp(coef(wv30c)) # growth rate from each time step
## (Intercept)
## 1.108238
wolves$num.wolves[2:31] * exp(coef(wv30c))
## [1] 6.649428 6.649428 14.407095 16.623571 11.082380 15.515333
## [7] 13.298857 36.571856 32.138903 52.087188 60.953093 53.195426
## [13] 111.932043 168.452183 236.054704 304.765463 373.476221 484.300026
## [19] 623.938020 734.761824 843.369153 937.569387 1130.402807 1440.709459
## [25] 1676.764163 1834.133966 1920.576533 1909.494153 1999.261435 1855.190489
with(wolves[2:31,], plot(year, num.wolves, ylim = c(0, 2000)))
points(1983:2012, exp(coef(wv30a)[1] + coef(wv30a)[2]*1983:2012), type = "l", col = "blue") # observation error
points(1983:2012, exp(coef(wv30b)[1] + coef(wv30b)[2]*1983:2012 + coef(wv30b)[3]*1983:2012*1983:2012), type = "l", col = "red")
points(1983:2012, predict(wv30c, type = "response"), type = "l", col = "green") # process error
ICtab(wv30a, wv30b, wv30c) # quadratic model wins
## dAIC df
## wv30b 0.0 3
## wv30c 86.2 1
## wv30a 1250.3 2
Practice example 8: Similarly, we could also model the brown hare counts per area as a Poisson offset instead of log(mean.density) using the normal distribution. Do you find similar estimates for mean density for the two landuse types when using Poisson distribution compared to log-transformed data?
dens12 <- glm(count1 ~ 1, offset = log(area), family = poisson, data = hares)
summary(dens12)
##
## Call:
## glm(formula = count1 ~ 1, family = poisson, data = hares, offset = log(area))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.348 -3.321 -1.175 2.026 12.752
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.482511 0.007233 205 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10492 on 665 degrees of freedom
## Residual deviance: 10492 on 665 degrees of freedom
## (286 observations deleted due to missingness)
## AIC: 13628
##
## Number of Fisher Scoring iterations: 5
dens13 <- glm(log(mean.density) ~ 1, family = gaussian, data = hares)
exp(coef(dens12)) # 4.4 brown hares per hectare
## (Intercept)
## 4.403988
exp(coef(dens13)) # 3.5 brown hares per hectare
## (Intercept)
## 3.348811
ICtab(dens12, dens13)
## dAIC df
## dens13 0.0 2
## dens12 11828.9 1
A common issue when fitting Poisson models is overdispersion, i.e., when count data has more variability than expected for a Poisson distribution. For Poisson models, data are less likely to be underdispersed for this given distribution. However, this is still likely to happen for biological data.
To test for overdispersion, the residual deviance should be equal to the degrees of freedom. In other words, the ratio of residual deviance/degrees of freedom should be equal to 1. If the ratio is greater than 1, then your data is more dispersed than the Poisson error distribution permits. If the ratio is less than 1, then your data are less dispersed than the Poisson error distribution.
Let’s test whether Northern Flicker counts are overdispersed, using the “intercept-only” model from the Poisson offset example #1.
nf1_offset <- glm(Count ~ 1, offset = log(hours), family = poisson, data = flickers) # model above
summary(nf1_offset)
##
## Call:
## glm(formula = Count ~ 1, family = poisson, data = flickers, offset = log(hours))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.375 -3.283 -2.691 -1.757 11.576
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.61801 0.04089 -64.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1567.4 on 101 degrees of freedom
## Residual deviance: 1567.4 on 101 degrees of freedom
## AIC: 1681.8
##
## Number of Fisher Scoring iterations: 6
nf1_offset$deviance/nf1_offset$df.residual # overdispersed
## [1] 15.51898
Let’s test whether hare counts using Poisson offset model was overdispersed.
dens12 <- glm(count1 ~ 1, offset = log(area), family = poisson, data = hares) # model above
summary(dens12)
##
## Call:
## glm(formula = count1 ~ 1, family = poisson, data = hares, offset = log(area))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.348 -3.321 -1.175 2.026 12.752
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.482511 0.007233 205 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10492 on 665 degrees of freedom
## Residual deviance: 10492 on 665 degrees of freedom
## (286 observations deleted due to missingness)
## AIC: 13628
##
## Number of Fisher Scoring iterations: 5
dens12$deviance/dens12$df.residual
## [1] 15.7779
In both cases, these data are overdispersed for the Poisson distribution.
There are multiple ways to deal with overdispersed count data:
family = nbinom1 or nbinom2)family = compois)Here, we will use the negative binomial distribution, which is the most common approach for overdispersion. A negative binomial distribution is essentially the average of different Poisson distributions with different means. Unlike the Poisson distribution where the mean is equal to the variance, the variance of the negative binomial can be either linear (nbinom1) or quaratic (nbinom2) parameterization of the mean. Therefore, a negative binomial mean closer to its limit -> \(\infty\) will resemble a Poisson distribution. But closer to the bound of 0, it allows for greater variance in relation to its mean.
Let’s re-run the intercept-only Poisson offset model for Northern Flicker counts using a negative binomial distribution using glmmTMB function in the glmmTMB package by Ben Bolker.
nf2 <- glmmTMB(Count ~ 1, offset = log(hours), family = nbinom1, data = flickers)
nf3 <- glmmTMB(Count ~ 1, offset = log(hours), family = nbinom2, data = flickers)
ICtab(nf1_offset, nf2, nf3) # variance is linearly parameterized to its mean
## dAIC df
## nf2 0.0 2
## nf3 9.5 2
## nf1_offset 1362.2 1
summary(nf1_offset)$coefficients # Poisson variance
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.618007 0.04089262 -64.02151 0
summary(nf2)$coefficients$cond # NB allows for greater variance
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.618007 0.3574455 -7.324214 2.403021e-13
Undispersion is far less common than overdispersion, but it can be often encountered in ecology with species that have small clutch/litter sizes. For example, when a bird may only lay up to 6 eggs per clutch.
Here, we have a dataset on begging behaviour of nestling barn owls that may have an example of underdispersed count data from package glmmTMB.
?Owls
Roulin and Bersier (2007) looked at nestlings’ response to the provisioning parent under two food treatments. Using microphones inside and a video camera outside the nests, studying vocal begging behaviour when the parents bring prey for 27 nests. We use ‘sibling negotiation,’ defined as the number of calls by the nestlings in the 30-second interval immediately prior to the arrival of a parent, divided by the number of nestlings. Data were collected between 21.30 hours and 05.30 hours on two consecutive nights. The variable ArrivalTime indicates the time at which a parent arrived at the perch with prey.
Using these data, we want to estimate the brood size of barn owls and evaluate whether these data are underdispersed.
attach(Owls)
Owls.sum <- Owls %>% group_by(Nest) %>% distinct(BroodSize) # dplyr - %>% pipe
hist(Owls.sum$BroodSize)
bs1 <- glm(BroodSize ~ 1, family = poisson, data = Owls.sum)
summary(bs1) # on the log-link scale
##
## Call:
## glm(formula = BroodSize ~ 1, family = poisson, data = Owls.sum)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.81192 -0.27973 -0.01846 0.46189 1.33404
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.39551 0.09578 14.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 12.274 on 26 degrees of freedom
## Residual deviance: 12.274 on 26 degrees of freedom
## AIC: 101.06
##
## Number of Fisher Scoring iterations: 4
exp(coef(bs1)) # back-transformed scale, approximately 4 nestlings per nest
## (Intercept)
## 4.037037
summary(bs1)$deviance/summary(bs1)$df.residual # underdispersed, < 1
## [1] 0.4720701
To address underdispersed count data, the most appropriate distribution would be the Conway-Maxwell-Poisson distribution that adds a parameter to the Poisson distribution to account for either underdispersion or overdispersion.
bs2 <- glmmTMB(BroodSize ~ 1, family = compois, data = Owls.sum)
ICtab(bs1, bs2)
## dAIC df
## bs2 0.0 2
## bs1 5.4 1
The “quasipoisson” distribution (family = quasipoisson) can also be used for underdispersed count data.
Practice example 9: Using the bumblebee dataset from practice example #7, run a Poisson GLM for new eggs laid as a function of resource treatment and evaluate whether these data are underdispersed or overdispersed for a Poisson distribution. Then, re-run a model with an appropriate probability distribution and evaluate whether the covariate of Treatment improves model fit.
bb_eggs3 <- glm(new.eggs ~ Treatment, family = poisson, data = eggs)
summary(bb_eggs3)$deviance/summary(bb_eggs3)$df.residual
## [1] 14.61368
bb_eggs4 <- glmmTMB(new.eggs ~ Treatment, family = nbinom1, data = eggs)
bb_eggs5 <- glmmTMB(new.eggs ~ Treatment, family = nbinom2, data = eggs)
bb_eggs6 <- glmmTMB(new.eggs ~ Treatment, family = compois, data = eggs)
ICtab(bb_eggs3, bb_eggs4, bb_eggs5, bb_eggs6) # nbinom1
## dAIC df
## bb_eggs4 0.0 4
## bb_eggs5 14.5 4
## bb_eggs6 36.2 4
## bb_eggs3 2273.0 3
Anova(bb_eggs4) # Treatment improves model fit
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: new.eggs
## Chisq Df Pr(>Chisq)
## Treatment 29.281 2 4.383e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(bb_eggs3)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1613713 0.02496102 126.65231 0.000000e+00
## TreatmentHigh-low -0.2859252 0.03708003 -7.71103 1.248065e-14
## TreatmentLow -0.8950524 0.04520354 -19.80049 2.948009e-87
summary(bb_eggs4)$coefficients$cond # more variance in terms
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.2061893 0.1138670 28.157306 1.950368e-174
## TreatmentHigh-low -0.3791834 0.1558135 -2.433572 1.495064e-02
## TreatmentLow -0.9551589 0.1767421 -5.404251 6.507980e-08
bb_means <- glmmTMB(new.eggs ~ -1 + Treatment, family = nbinom1, data = eggs)
exp(summary(bb_means)$coefficients$cond[, 1])
## TreatmentHigh TreatmentHigh-low TreatmentLow
## 24.684884 16.894797 9.497514
Here, we will run two different models when you encounter many zeros in your count data (i.e., zero-inflated count data). We can take two potential approaches: zero-inflated regression or hurdle model. The first assumes that not all zeros are “true” zeros and that some are part of the Poisson process. This can commonly occur due to observation error. For example, your count data may be number of butterflies seen, but you are not sure that not seeing an individual means that there were actually no individuals present. For the zero-inflated models, the binomial process may determine whether a location is actually suitable habitat and the count process may represent the quality of the suitable habitat. However, remember that a count of 0 may not necessarily mean that it is a not suitable habitat.
A zero-inflated Poisson model is required for this process, since not all zeros are true. Therefore, a zeroinflated model evaluates the nonzero and zero process together by evaluating the probability that a zero comes from the main “nonzero” distribution vs. the binomial distribution (i.e., an excess zero).
You can implement zero-inflated models using zeroinfl function in the pscl package.
pscl::zeroinfl(formula = y ~ x_count | x_zero)
However, we will be using the glmmTMB package to run both zero-inflated and hurdle models, since this package has additional probability distributions not in pscl and it also allows for both these models to have mixed-effects, which we will discuss later.
hist(flickers$Count)
ZIP1 <- glmmTMB(Count ~ 1, offset = log(hours), ziformula = ~ 1, family = "nbinom1", data = flickers)
summary(ZIP1)
## Family: nbinom1 ( log )
## Formula: Count ~ 1
## Zero inflation: ~1
## Data: flickers
## Offset: log(hours)
##
## AIC BIC logLik deviance df.resid
## 314.4 322.3 -154.2 308.4 99
##
##
## Overdispersion parameter for nbinom1 family (): 20.7
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6833 0.2069 -8.137 4.06e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8205 0.2995 2.74 0.00615 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(ZIP1)$coefficients
## $cond
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.683347 0.2068816 -8.136766 4.059764e-16
##
## $zi
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8205285 0.2994806 2.739839 0.006146936
##
## $disp
## NULL
ZIP2 <- glmmTMB(Count ~ 1, offset = log(hours), ziformula = ~Latitude, family = "nbinom1", data = flickers) # Same as ZiP2
lrtest(ZIP1, ZIP2) # significant effect of latitude on zero process
## Likelihood ratio test
##
## Model 1: Count ~ 1
## Model 2: Count ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -154.22
## 2 4 -150.09 1 8.2432 0.004091 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(ZIP2)$coefficients
## $cond
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.650126 0.1881545 -8.770061 1.785701e-18
##
## $zi
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.3782065 10.0536801 -2.623736 0.008697107
## Latitude 0.6364322 0.2355161 2.702287 0.006886428
##
## $disp
## NULL
ZIP3 <- glmmTMB(Count ~ Latitude, offset = log(hours), ziformula = ~., family = "nbinom1", data = flickers)
lrtest(ZIP2, ZIP3)
## Likelihood ratio test
##
## Model 1: Count ~ 1
## Model 2: Count ~ Latitude
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -150.09
## 2 5 -141.44 1 17.312 3.172e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(ZIP3)$coefficients
## $cond
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 28.2685819 6.8672159 4.116455 3.847452e-05
## Latitude -0.7089281 0.1638307 -4.327199 1.510172e-05
##
## $zi
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.2387685 11.8687106 -1.199690 0.2302599
## Latitude 0.3506068 0.2802133 1.251214 0.2108564
##
## $disp
## NULL
For hurdle models, the nonzero and zero processes are modelled separately, and therefore, we assume all zeros are true zeros. The glmmTMB package also allows for hurdle models for count data using the “truncated” families (i.e., family = truncated_poisson). Note that use of the Poisson family would indicate a zero-inflated model. Here, we are using the owl dataset that is included in glmmTMB package.
table(Owls$SiblingNegotiation)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 156 24 43 29 26 26 40 30 22 17 22 16 23 17 27 13 12 12 11 9
## 20 22 23 24 25 26 28 31 32
## 4 5 5 2 1 2 1 2 2
hist(Owls$SiblingNegotiation)
h_mod_pois <- glmmTMB(SiblingNegotiation ~ FoodTreatment, ziformula = ~FoodTreatment,
family = truncated_poisson(link = "log"), data = Owls)
h_mod_nbin1 <- glmmTMB(SiblingNegotiation ~ FoodTreatment, ziformula = ~FoodTreatment,
family = truncated_nbinom1(link = "log"), data = Owls)
h_mod_nbin2 <- glmmTMB(SiblingNegotiation ~ FoodTreatment, ziformula = ~FoodTreatment,
family = truncated_nbinom2(link = "log"), data = Owls)
h_mod_compois <- glmmTMB(SiblingNegotiation ~ FoodTreatment, ziformula = ~FoodTreatment,
family = truncated_nbinom2(link = "log"), data = Owls)
ICtab(h_mod_pois, h_mod_nbin1, h_mod_nbin2, h_mod_compois, base = T) # do we need to account for overdispersion?
## AIC dAIC df
## h_mod_nbin1 3381.1 0.0 5
## h_mod_nbin2 3384.9 3.9 5
## h_mod_compois 3384.9 3.9 5
## h_mod_pois 4172.2 791.2 4
summary(h_mod_nbin1)
## Family: truncated_nbinom1 ( log )
## Formula: SiblingNegotiation ~ FoodTreatment
## Zero inflation: ~FoodTreatment
## Data: Owls
##
## AIC BIC logLik deviance df.resid
## 3381.1 3403.0 -1685.5 3371.1 594
##
##
## Overdispersion parameter for truncated_nbinom1 family (): 3.95
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.24752 0.04404 51.04 <2e-16 ***
## FoodTreatmentSatiated -0.19661 0.07691 -2.56 0.0106 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8101 0.1608 -11.256 < 2e-16 ***
## FoodTreatmentSatiated 1.3957 0.2020 6.908 4.92e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the deer data, let’s assume that method for detecting Elaphostrongylus cervi parasites in red deer is very accurate and all zeros are true zeros. In other words, there are no potential false negatives (not detecting parasite, when it’s present). Here, let’s evaluate whether the probability of infection and the infection intensity (i.e., the number of parasites per gram of wet feces) is dependent on both kidney fat index (KFI).
with(deer, hist(Ecervi)) # zero-inflated
deer$PEcervi <- deer$Ecervi
deer$PEcervi[deer$PEcervi > 0] <- 1
h_zero <- glm(PEcervi ~ Sex, family = binomial, data = deer)
Anova(h_zero)
## Analysis of Deviance Table (Type II tests)
##
## Response: PEcervi
## LR Chisq Df Pr(>Chisq)
## Sex 11.151 1 0.0008397 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
h_nzero <- glm(log(Ecervi) ~ KFI, family = gaussian, data = subset(deer, PEcervi == 1))
Anova(h_nzero) # no interaction
## Analysis of Deviance Table (Type II tests)
##
## Response: log(Ecervi)
## LR Chisq Df Pr(>Chisq)
## KFI 9.4023 1 0.002167 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
non_hurdle <- glm(log(Ecervi + 0.01) ~ Sex*KFI, family = gaussian, data = deer)
Anova(non_hurdle)
## Analysis of Deviance Table (Type II tests)
##
## Response: log(Ecervi + 0.01)
## LR Chisq Df Pr(>Chisq)
## Sex 16.4605 1 4.967e-05 ***
## KFI 0.1323 1 0.7160
## Sex:KFI 4.1884 1 0.0407 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Is the hurdle model better?
AIC(non_hurdle)
## [1] 3665.097
2*(length(coef(h_zero)) + length(coef(h_nzero))) - 2*(logLik(h_zero) + logLik(h_nzero)) # AIC = 2k - 2LL
## 'log Lik.' 2647.028 (df=2)
Practice example 10: The Eastern bluebird data was also collected across New England from the Audubon Christmas bird count. Similar to the Northern Flickers, the Eastern Bluebirds have breeding grounds in the northern half of the United States, but all year grounds from Massachusetts down to Texas and exclusive wintering grounds in Texas and northern Mexico.
bluebirds <- read.csv(here::here("Session 2", "Data", "bluebirds.csv"))
head(bluebirds)
## Code Latitude Longitude hours count
## 1 CTBA 41.9123 -72.9884 72.50 127
## 2 CTEW 41.7966 -71.9274 48.50 72
## 3 CTGS 41.0826 -73.6138 182.50 60
## 4 CTHA 41.7660 -72.6727 198.00 31
## 5 CTLH 41.7703 -73.2724 69.95 122
## 6 CTLS 41.9449 -73.4399 49.50 65
Using the bluebirds data, choose whether a zeroinflated model, hurdle model, or GLM of bird counts with an offset of observation hours is more appropriate for these data. Include a covariate of both latitude on the zero (i.e., probability that it has migrated) process of the models.
with(bluebirds, hist(count/hours))
bb_zip <- glmmTMB(count ~ 1, offset = log(hours), ziformula = ~Latitude, family = poisson, data = bluebirds)
summary(bb_zip) # overdispersed
## Family: poisson ( log )
## Formula: count ~ 1
## Zero inflation: ~Latitude
## Data: bluebirds
## Offset: log(hours)
##
## AIC BIC logLik deviance df.resid
## 3381.2 3389.1 -1687.6 3375.2 99
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.36649 0.01482 -24.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -67.1276 13.8054 -4.862 1.16e-06 ***
## Latitude 1.5267 0.3156 4.837 1.32e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
bb_zinb <- glmmTMB(count ~ 1, offset = log(hours), ziformula = ~Latitude, family = nbinom1, data = bluebirds)
bb_nzi <- glmmTMB(count ~ Latitude, offset = log(hours), family = nbinom1, data = bluebirds)
ICtab(bb_zinb, bb_nzi) # both are within dAIC
## dAIC df
## bb_zinb 0.0 4
## bb_nzi 21.8 3
Here, we evaluated models with only parameters with fixed or non-random effects (i.e., fixed group means). Fixed effects are generally the covariates of interest. Here, we will run models with covariates that evaluate both fixed and random effects called “mixed models” or “mixed effects models”. These are a form of hierarchical models where the fixed effects are no longer assumed to be independent, but they come from a second distribution with some mean and variance of some “hyperparameter/s”. For example, a simple fixed-effects ANOVA evaluating group means can be written as follows:
\[ y_{i} = \alpha_{j(i)} + \epsilon_{i} \]
where \(\alpha_{j(i)}\) is the mean of response variable \(y_{i}\) for population j and \(\epsilon_{i}\) is the random deviance of i from its population mean \(\alpha_{j(i)}\). The random deviance of \(y_{i}\) from each group mean comes from a normal distribution:
\[ \epsilon_{i} \sim Normal (0, \sigma^2) \]
where the mean is 0 and the variance is \(\sigma^2\) (remember that standard error of the mean estimate \(\sigma\) is the square root of the variance).
For a random-effects ANOVA, the \(\alpha_{j(i)}\) paramaters come from a second distribution with some mean \(\mu\) and variance \(\tau^2\):
\[ \alpha_{j(i)} \sim Normal(\mu, \tau^2) \]
NOTE: a random-effects model rarely contains less than than five populations/groups, since estimating variance with sample size less than five will not result in very precise estimates.
A common way to distinguish fixed from random effects is respectively based on what is the subject of interest vs some random “latent” variable that may result in greater variance in your model estimates but is not of interest.
Kery 2010 (CH 12) outlines four potential random-coefficient models:
Below, we will run all four combinations of random-coefficient models for the in-built dataset on Salamanders using lme4::glmer(). The Salamander dataset contains counts of salamander across sites that were either affected by mountain top coal mining (“mined”). The potential covariates of interest affecting counts are species of salamander (“spp”), cover objects in the stream (“cover”), days since precipitation (DOP), water temperature (Wtemp), and day of year (DOY).
attach(Salamanders)
head(Salamanders)
## site mined cover sample DOP Wtemp DOY spp count
## 1 VF-1 yes -1.4423172 1 -0.5956834 -1.22937861 -1.497003 GP 0
## 2 VF-2 yes 0.2984104 1 -0.5956834 0.08476529 -1.497003 GP 0
## 3 VF-3 yes 0.3978806 1 -1.1913668 1.01417627 -1.294467 GP 0
## 4 R-1 no -0.4476157 1 0.0000000 -3.02335795 -2.712216 GP 2
## 5 R-2 no 0.5968209 1 0.5956834 -0.14434533 -0.686860 GP 2
## 6 R-3 no 1.3428470 1 0.5956834 -0.01466007 -0.686860 GP 1
Our models will contain fixed effects of day of year (DOY) on the presence of salamanders. First, we will run a binomial GLM without any random effects.
Salamanders$Presence <- Salamanders$count
Salamanders$Presence[Salamanders$Presence > 1] <- 1
sm0 <- glm(Presence ~ DOY, family = binomial, data = Salamanders)
Anova(sm0)
## Analysis of Deviance Table (Type II tests)
##
## Response: Presence
## LR Chisq Df Pr(>Chisq)
## DOY 3.6403 1 0.0564 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First, we will run a random-intercept model (probably the most commonly used random-coefficient model) that allows the model estimate for the intercept to come from a second distribution that estimates a mean and variance from intercepts from each species.
sm1 <- glmer(Presence ~ DOY + (1 | spp), family = binomial, data = Salamanders)
summary(sm1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: Presence ~ DOY + (1 | spp)
## Data: Salamanders
##
## AIC BIC logLik deviance df.resid
## 835.8 849.2 -414.9 829.8 641
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.1802 -0.9066 -0.5072 1.0167 2.3645
##
## Random effects:
## Groups Name Variance Std.Dev.
## spp (Intercept) 0.3691 0.6076
## Number of obs: 644, groups: spp, 7
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.45926 0.24503 -1.874 0.0609 .
## DOY -0.16646 0.08433 -1.974 0.0484 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## DOY 0.011
ranef(sm1)$spp
## (Intercept)
## GP 0.09401095
## PR -1.04578323
## DM 0.37141759
## EC-A -0.65876778
## EC-L 0.37141759
## DES-L 0.52802873
## DF 0.37141759
# Plotting each intercept
DOY.vals <- seq(min(Salamanders$DOY), max(Salamanders$DOY), length.out = 100)
re_list1 <- list()
for(i in 1:length(unique(Salamanders$spp))){
re_list1[[i]] <- data.frame(spp = rep(unique(Salamanders$spp)[i], 100), DOY = DOY.vals,
int = rep(ranef(sm1)$spp[i,1], 100),
slope = rep(fixef(sm1)[2], 100))
}
re_dat1 <- do.call("rbind", re_list1)
re_dat1$pred <- plogis(re_dat1$int + re_dat1$slope*re_dat1$DOY) # y = B0 + B1x
re_mod1 <- data.frame(DOY = DOY.vals, pred = plogis(fixef(sm1)[1] + fixef(sm1)[2]*DOY.vals))
ggplot(re_dat1, aes(x = DOY, y = pred, col = spp)) + geom_line() +
geom_line(re_mod1, mapping = aes(x = DOY, y = pred, col = NA), col = "black", size = 2) +
scale_y_continuous(limits = c(0, 1))
Second, we will run a random-slope model (note that this is often not a sensible model) that allows the model estimate for the slope only to come from a second distribution that estimates a mean and variance from slopes from each spp.
sm2 <- glmer(Presence ~ DOY + (0 + DOY | spp), family = binomial, data = Salamanders)
summary(sm2)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: Presence ~ DOY + (0 + DOY | spp)
## Data: Salamanders
##
## AIC BIC logLik deviance df.resid
## 867.7 881.1 -430.8 861.7 641
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.0880 -0.8211 -0.7503 1.2000 1.4875
##
## Random effects:
## Groups Name Variance Std.Dev.
## spp DOY 0.03113 0.1764
## Number of obs: 644, groups: spp, 7
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4146 0.0811 -5.113 3.18e-07 ***
## DOY -0.1570 0.1057 -1.486 0.137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## DOY 0.024
ranef(sm2)$spp
## DOY
## GP 0.044387593
## PR 0.044354882
## DM 0.143815374
## EC-A -0.232693717
## EC-L -0.005089487
## DES-L -0.058581025
## DF 0.067178689
# Plotting each intercept
re_list2 <- list()
for(i in 1:length(unique(Salamanders$spp))){
re_list2[[i]] <- data.frame(spp = rep(unique(Salamanders$spp)[i], 100), DOY = DOY.vals,
int = rep(fixef(sm2)[1], 100),
slope = rep(ranef(sm2)$spp[i,1], 100))
}
re_dat2 <- do.call("rbind", re_list2)
re_dat2$pred <- plogis(re_dat2$int + re_dat2$slope*re_dat2$DOY)
re_mod2 <- data.frame(DOY = DOY.vals, pred = plogis(fixef(sm2)[1] + fixef(sm2)[2]*DOY.vals))
ggplot(re_dat2, aes(x = DOY, y = pred, col = spp)) + geom_line() +
geom_line(re_mod2, mapping = aes(x = DOY, y = pred, col = NA), col = "black", size = 2) +
scale_y_continuous(limits = c(0, 1))
Third, we will run a random-intercept and -slope model that allows the model estimates for both intercept and slopes to come from another distribution that estimates a mean and variance from intercepts and slopes from each spp.
sm3 <- glmer(Presence ~ DOY + (1 | spp) + (0 + DOY | spp), family = binomial, data = Salamanders)
summary(sm3)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: Presence ~ DOY + (1 | spp) + (0 + DOY | spp)
## Data: Salamanders
##
## AIC BIC logLik deviance df.resid
## 835.6 853.5 -413.8 827.6 640
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.2584 -0.9054 -0.4650 1.0375 2.3975
##
## Random effects:
## Groups Name Variance Std.Dev.
## spp (Intercept) 0.39184 0.6260
## spp.1 DOY 0.06065 0.2463
## Number of obs: 644, groups: spp, 7
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4724 0.2521 -1.874 0.061 .
## DOY -0.1840 0.1275 -1.443 0.149
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## DOY 0.018
ranef(sm3)$spp
## (Intercept) DOY
## GP 0.1076772 0.07848642
## PR -1.0423741 0.04858182
## DM 0.3860041 0.21715918
## EC-A -0.7159298 -0.39153518
## EC-L 0.3854927 0.01204268
## DES-L 0.5436764 -0.06212241
## DF 0.3859276 0.11170560
# Plotting each intercept
re_list3 <- list()
for(i in 1:length(unique(Salamanders$spp))){
re_list3[[i]] <- data.frame(spp = rep(unique(Salamanders$spp)[i], 100), DOY = DOY.vals,
int = rep(ranef(sm3)$spp[i,1], 100),
slope = rep(ranef(sm3)$spp[i,2], 100))
}
re_dat3 <- do.call("rbind", re_list3)
re_dat3$pred <- plogis(re_dat3$int + re_dat3$slope*re_dat3$DOY)
re_mod3 <- data.frame(DOY = DOY.vals, pred = plogis(fixef(sm3)[1] + fixef(sm3)[2]*DOY.vals))
ggplot(re_dat3, aes(x = DOY, y = pred, col = spp)) + geom_line() +
geom_line(re_mod3, mapping = aes(x = DOY, y = pred, col = NA), col = "black", size = 2) +
scale_y_continuous(limits = c(0, 1))
Finally, we will run a random-intercept and -slope model with a correlation between the coefficients that allows the model estimates for both intercept and slopes to come from another distribution that estimates a mean and variance from intercepts and slopes from each spp.
sm4 <- glmer(Presence ~ DOY + (1 + DOY | spp), family = binomial, data = Salamanders)
summary(sm4)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: Presence ~ DOY + (1 + DOY | spp)
## Data: Salamanders
##
## AIC BIC logLik deviance df.resid
## 835.5 857.8 -412.7 825.5 639
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.1351 -0.9185 -0.4679 1.0250 2.6872
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## spp (Intercept) 0.40819 0.6389
## DOY 0.07042 0.2654 0.76
## Number of obs: 644, groups: spp, 7
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4815 0.2570 -1.874 0.061 .
## DOY -0.2058 0.1337 -1.539 0.124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## DOY 0.567
ranef(sm4)$spp
## (Intercept) DOY
## GP 0.1360998 0.08876791
## PR -0.9893844 -0.18413320
## DM 0.4395231 0.24221032
## EC-A -0.8342862 -0.43967336
## EC-L 0.3821675 0.09006816
## DES-L 0.5113961 0.06423783
## DF 0.4101393 0.16394280
# Plotting each intercept and slope
re_list4 <- list()
for(i in 1:length(unique(Salamanders$spp))){
re_list4[[i]] <- data.frame(spp = rep(unique(Salamanders$spp)[i], 100),
DOY = DOY.vals,
int = rep(ranef(sm4)$spp[i,1], 100),
slope = rep(ranef(sm4)$spp[i,2], 100))
}
re_dat4 <- do.call("rbind", re_list4)
re_dat4$pred <- plogis(re_dat4$int + re_dat4$slope*re_dat4$DOY)
re_mod4 <- data.frame(DOY = DOY.vals, pred = plogis(fixef(sm4)[1] + fixef(sm4)[2]*DOY.vals))
ggplot(re_dat4, aes(x = DOY, y = pred, col = spp)) + geom_line() +
geom_line(re_mod4, mapping = aes(x = DOY, y = pred, col = NA), col = "black", size = 2) +
scale_y_continuous(limits = c(0, 1))
To evaluate which random-coefficient model wins, we can use AIC values to rank these models.
ICtab(sm0, sm1, sm2, sm3, sm4, base = T) # random-intercept model
## AIC dAIC df
## sm4 835.5 0.0 5
## sm3 835.6 0.1 4
## sm1 835.8 0.3 3
## sm0 866.7 31.2 2
## sm2 867.7 32.2 3
Practice example 11: For this practice example, we will be using the built-in dataset on grouse ticks from the lme4 package (see ?grouseticks). This dataset contains counts of ticks (TICKS) on the heads of red grouse chicks sampled in the field from different broods (BROOD).
attach(grouseticks)
head(grouseticks)
## INDEX TICKS BROOD HEIGHT YEAR LOCATION cHEIGHT
## 1 1 0 501 465 95 32 2.759305
## 2 2 0 501 465 95 32 2.759305
## 3 3 0 502 472 95 36 9.759305
## 4 4 0 503 475 95 37 12.759305
## 5 5 0 503 475 95 37 12.759305
## 6 6 3 503 475 95 37 12.759305
We want you to evaluate the effects of height above sea level (using “cHEIGHT”) on the number of ticks on Red Grouse chicks (TICKS) conditioned on having ticks using random-intercept model for brood number. Since this is a conditional vital rate, we assume that all ticks are detectable and that these data can be subseted for values greater than 0. We expect you to evaluate the correct error distribution for these data, and then find the winning model using model selection.
with(subset(grouseticks, TICKS > 0), hist(TICKS))
gt_pois <- glmmTMB(TICKS ~ cHEIGHT, family = truncated_poisson, data = subset(grouseticks, TICKS > 0))
summary(gt_pois) # overdispersed
## Family: truncated_poisson ( log )
## Formula: TICKS ~ cHEIGHT
## Data: subset(grouseticks, TICKS > 0)
##
## AIC BIC logLik deviance df.resid
## 4283.7 4290.9 -2139.8 4279.7 275
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.935787 0.027328 70.83 <2e-16 ***
## cHEIGHT -0.016178 0.000741 -21.83 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
gt_nb <- glmmTMB(TICKS ~ cHEIGHT, family = truncated_nbinom1, data = subset(grouseticks, TICKS > 0))
gt_nb_ri <- glmmTMB(TICKS ~ cHEIGHT + (1 | YEAR),
family = truncated_nbinom1, data = subset(grouseticks, TICKS > 0))
gt_nb_ris <- glmmTMB(TICKS ~ cHEIGHT + (1 | BROOD) + (0 + cHEIGHT | BROOD),
family = truncated_nbinom1, data = subset(grouseticks, TICKS > 0))
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
## Warning in fitTMB(TMBStruc): Model convergence problem; false convergence (8).
## See vignette('troubleshooting')
gt_nb_rb <- glmmTMB(TICKS ~ cHEIGHT + (cHEIGHT | BROOD),
family = truncated_nbinom1, data = subset(grouseticks, TICKS > 0))
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in (function (start, objective, gradient = NULL, hessian = NULL, : NA/
## NaN function evaluation
## Warning in fitTMB(TMBStruc): Model convergence problem; non-positive-definite
## Hessian matrix. See vignette('troubleshooting')
## Warning in fitTMB(TMBStruc): Model convergence problem; false convergence (8).
## See vignette('troubleshooting')
ICtab(gt_nb, gt_nb_ri, gt_nb_ris, gt_nb_rb)
## dAIC df
## gt_nb_ri 0 4
## gt_nb 30 3
## gt_nb_ris NA 5
## gt_nb_rb NA 6
Nested random effects are when each sample of one group is contained entirely within a single unit of another group. The common example for nested groups would be that students who are assigned to a particular classroom are nested within that classroom.
Here, we know that each of the 56 sites are nested within the eight regions of Switzerland that were censused for brown hares from 1992 to 2008. Here, we will run a nested random-intercept model.
dens_re1 <- lmer(log(mean.density) ~ yearPost + (1 | region/site), data = hares)
summary(dens_re1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(mean.density) ~ yearPost + (1 | region/site)
## Data: hares
##
## REML criterion at convergence: 1123.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.1577 -0.5086 0.0309 0.6289 3.5523
##
## Random effects:
## Groups Name Variance Std.Dev.
## site:region (Intercept) 0.59981 0.7745
## region (Intercept) 0.06682 0.2585
## Residual 0.22715 0.4766
## Number of obs: 677, groups: site:region, 56; region, 8
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1.167050 0.149353 7.814
## yearPost -0.005867 0.004152 -1.413
##
## Correlation of Fixed Effects:
## (Intr)
## yearPost -0.254
ranef(dens_re1)
## $`site:region`
## (Intercept)
## AG01:CH.Central -0.05842063
## AG02:CH.Central 0.08364256
## AG03:CH.Central 0.16960042
## AG04:CH.Central -1.19459514
## BE01:Aare -0.26042731
## BE02:Aare -0.33709338
## BE03:Aare -1.07550250
## BE04:Aare -0.17830184
## BE05:Aare -0.74192031
## BE06:Aare 0.14157600
## BE08:CH.W -0.26247600
## BE09:CH.W 0.75464018
## BE12:CH.W 1.62986620
## BE19:Aare 0.57287272
## BE20:Aare 0.62564613
## BE23:Aare -0.91888334
## BE24:Aare 0.22038043
## BE25:Aare 0.30249139
## BE27:Aare -0.09835289
## BE28:Aare -0.66850156
## BL02:Baselland 0.06762409
## BL04:Baselland 0.57641701
## BL05:Baselland -1.42193756
## BR02:CH.W -0.52054755
## BR03:CH.W -1.47531895
## BR04:CH.W -0.42467699
## GE01:CH.SW 0.97195119
## GE02:CH.SW 0.75035374
## GE03:CH.SW 1.02575712
## LI04:CH.E -0.67315032
## LI05:CH.E -1.08869201
## LI15:CH.E -0.37780363
## LU01:CH.Central 0.08389330
## LU07A:CH.Central -0.23928683
## LU07B:CH.Central -0.21652982
## SG06:CH.E 0.58641192
## SG07:CH.E 1.04520584
## SG08:CH.E 0.09202055
## SG09:CH.E 0.40455499
## SH03:CH.N 0.20469902
## SH04:CH.N 0.64147779
## SH05:CH.N -0.71522068
## SH07:CH.N 0.55965618
## SH1_2:CH.N 0.27298231
## SO01:Aare 0.94005670
## SO02:Aare 1.20524487
## SO03:Aare 0.66010019
## TG06B:CH.E 0.17929602
## TG07:CH.E -1.06998388
## VD01:CH.SW 0.07485527
## VD02:CH.W -0.49424157
## VD04:CH.W 0.28888268
## VD05:Rhone 0.21580685
## VS02:Rhone 0.69417389
## VS04:Rhone -1.99148371
## ZH06:CH.N 0.46121089
##
## $region
## (Intercept)
## Aare 0.04337776
## Baselland -0.08665815
## CH.Central -0.15280781
## CH.E -0.10049902
## CH.N 0.15872422
## CH.SW 0.31447475
## CH.W -0.05613165
## Rhone -0.12048011
##
## with conditional variances for "site:region" "region"
dens_re2 <- lmer(log(mean.density) ~ yearPost + (1 | region) + (1|region:site), data = hares) # same model
ranef(dens_re2)
## $`region:site`
## (Intercept)
## Aare:BE01 -0.26042731
## Aare:BE02 -0.33709338
## Aare:BE03 -1.07550250
## Aare:BE04 -0.17830184
## Aare:BE05 -0.74192031
## Aare:BE06 0.14157600
## Aare:BE19 0.57287272
## Aare:BE20 0.62564613
## Aare:BE23 -0.91888334
## Aare:BE24 0.22038043
## Aare:BE25 0.30249139
## Aare:BE27 -0.09835289
## Aare:BE28 -0.66850156
## Aare:SO01 0.94005670
## Aare:SO02 1.20524487
## Aare:SO03 0.66010019
## Baselland:BL02 0.06762409
## Baselland:BL04 0.57641701
## Baselland:BL05 -1.42193756
## CH.Central:AG01 -0.05842063
## CH.Central:AG02 0.08364256
## CH.Central:AG03 0.16960042
## CH.Central:AG04 -1.19459514
## CH.Central:LU01 0.08389330
## CH.Central:LU07A -0.23928683
## CH.Central:LU07B -0.21652982
## CH.E:LI04 -0.67315032
## CH.E:LI05 -1.08869201
## CH.E:LI15 -0.37780363
## CH.E:SG06 0.58641192
## CH.E:SG07 1.04520584
## CH.E:SG08 0.09202055
## CH.E:SG09 0.40455499
## CH.E:TG06B 0.17929602
## CH.E:TG07 -1.06998388
## CH.N:SH03 0.20469902
## CH.N:SH04 0.64147779
## CH.N:SH05 -0.71522068
## CH.N:SH07 0.55965618
## CH.N:SH1_2 0.27298231
## CH.N:ZH06 0.46121089
## CH.SW:GE01 0.97195119
## CH.SW:GE02 0.75035374
## CH.SW:GE03 1.02575712
## CH.SW:VD01 0.07485527
## CH.W:BE08 -0.26247600
## CH.W:BE09 0.75464018
## CH.W:BE12 1.62986620
## CH.W:BR02 -0.52054755
## CH.W:BR03 -1.47531895
## CH.W:BR04 -0.42467699
## CH.W:VD02 -0.49424157
## CH.W:VD04 0.28888268
## Rhone:VD05 0.21580685
## Rhone:VS02 0.69417389
## Rhone:VS04 -1.99148371
##
## $region
## (Intercept)
## Aare 0.04337776
## Baselland -0.08665815
## CH.Central -0.15280781
## CH.E -0.10049902
## CH.N 0.15872422
## CH.SW 0.31447475
## CH.W -0.05613165
## Rhone -0.12048011
##
## with conditional variances for "region:site" "region"
dens_re3 <- lmer(log(mean.density) ~ yearPost + (1 | region), data = hares) # same model
ICtab(dens_re2, dens_re3)
## dAIC df
## dens_re2 0.0 5
## dens_re3 572.9 4
Practice example 12: For the grasshopper dataset used in example #2, re-run the two-way ANOVA with interaction of Origin and Treatment on male grasshopper song with nested random-intercept effects of site and individual. Using AIC, compare this model with no random effects and only random-intercept effects of Site to find the winning model.
gh1 <- glmmTMB(LocMax ~ Origin*Treatment, family = gaussian, data = ghp)
gh1 <- glmmTMB(LocMax ~ Origin*Treatment, family = gaussian, data = ghp)
gh2 <- glmmTMB(LocMax ~ Origin*Treatment + (1 | Site/Individual), family = gaussian, data = ghp)
gh3 <- glmmTMB(LocMax ~ Origin*Treatment + (1 | Site), family = gaussian, data = ghp)
ICtab(gh1, gh2, gh3)
## dAIC df
## gh2 0.0 7
## gh3 534.0 6
## gh1 547.7 5
What does this winning model suggest about the variance in male grasshopper songs across sites and individuals?
Here, this is a very brief introduction to simulating data from three types of distributions: the normal distribution, binomial distribution, and Poisson distribution.
I will introduce the density function (e.g., dnorm()), distribution function (e.g., pnorm()), quantile function (e.g., qnorm()) and random generation function (e.g., rnorm()) for these distributions.
First, we will sample from the normal distribution
set.seed(1)
n <- 100000 # sample size
mu <- 700 # mean body mass of male pelegrines
sd <- 100 # SD of body size
sample <- rnorm(n = n, mean = mu, sd = sd)
head(sample) # vector of randomly generated numbers
## [1] 637.3546 718.3643 616.4371 859.5281 732.9508 617.9532
par(mfrow = c(1,1))
hist(sample, col = "grey")
plot(0:1400, dnorm(x = 0:1400, mean = mu, sd = sd), type = "l",
ylab = "Density probability function", xlab = "y value")
sum(dnorm(x = 0:1400, mean = mu, sd = sd))
## [1] 1
pnorm(q = 650, mean = mu, sd = sd, lower.tail = T) # cumulative density function
## [1] 0.3085375
plot(500:900, pnorm(q = 500:900, mean = mu, sd = sd), type = "l",
ylab = "Cumulative density probability", xlab = "y value")
pnorm(q = 650, mean = mu, sd = sd, lower.tail = F)
## [1] 0.6914625
plot(x, pnorm(q = x, mean = mu, sd = sd, lower.tail = F), type = "l",
ylab = "1 - cumulative density probability", xlab = "y value")
qnorm(p = 0.90, mean = mu, sd = sd) # what is the 90% percentile of male pelegrine masses?
## [1] 828.1552
qnorm(p = 0.10, mean = mu, sd = sd) # what is the 10% percentile of male pelegrine masses?
## [1] 571.8448
qnorm(p = 0.90, mean = mu, sd = sd, lower.tail = F)
## [1] 571.8448
Now, we will simulate two linear models: ANOVA and linear regression.
set.seed(1)
# Differences between means (e.g., ANOVA)
b0 = 700 # mean male pelegrine
b1 = 300 # females weigh approximate 1 kg, so a difference of 300 g from males.
male.sd <- 100
fem.sd <- 200
sex <- rep(c("male", "female"), each = 100)
eps <- c(rnorm(n = 100, mean = 0, sd = male.sd), # male sd
rnorm(n = 100, mean = 0, sd = fem.sd)) # female sd
weights <- b0 + b1*(sex == "female") + eps
bird.dat <- data.frame(sex, weights)
ggplot(bird.dat, aes(x = sex, y = weights)) + geom_boxplot()
bird.lm <- lm(weights ~ sex, data = bird.dat)
summary(bird.lm) # same se of the mean
##
## Call:
## lm(formula = weights ~ sex, data = bird.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -375.31 -82.60 -11.63 72.73 469.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 992.44 14.96 66.33 <2e-16 ***
## sexmale -281.55 21.16 -13.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 149.6 on 198 degrees of freedom
## Multiple R-squared: 0.4721, Adjusted R-squared: 0.4694
## F-statistic: 177.1 on 1 and 198 DF, p-value: < 2.2e-16
coef(bird.lm)
## (Intercept) sexmale
## 992.4384 -281.5496
# Linear regression
nB1 <- 20 # intercept
nB0 <- 0.9 # slope
x.vals <- seq(10, 80, length.out = 500)
error <- rnorm(500, mean = 0, sd = 15)
y.vals <- nB1 + nB0 * x.vals + error
plot(x.vals, y.vals)
lin.reg <- lm(y.vals ~ x.vals)
Anova(lin.reg)
## Anova Table (Type II tests)
##
## Response: y.vals
## Sum Sq Df F value Pr(>F)
## x.vals 145198 1 586.95 < 2.2e-16 ***
## Residuals 123194 498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coef(lin.reg) # very close to our defined intercept and slope
## (Intercept) x.vals
## 22.0209274 0.8416266
plot(x.vals, coef(lin.reg)[1] + coef(lin.reg)[2]*x.vals, type = "l", ylim = c(0, 120))
points(x.vals, y.vals)
Second, we will learn to sample from the binomial distribution.
n <- 100000 # Sample size
N <- 16 # Number of individuals
p <- 0.8 # Probability of surviving
sample <- rbinom(n = n, size = N, prob = p)
head(sample)
## [1] 14 12 9 13 13 11
hist(sample, col = "grey")
plot(0:16, dbinom(x = 0:16, size = N, prob = p), ylab = "density function", xlab = "k events out of 16 trials")
plot(0:16, pbinom(q = 0:16, size = N, prob = p), ylab = "density function", xlab = "k events out of 16 trials")
pbinom(q = 11, size = N, prob = p) # probability of getting 11 or less success out of 16 trials
## [1] 0.2017546
pbinom(q = 11, size = N, prob = p, lower.tail = F) # probability of getting 11 or more
## [1] 0.7982454
qbinom(p = 0.9, size = N, prob = p) # 90% percentile
## [1] 15
qbinom(p = 0.1, size = N, prob = p) # 10% percentile
## [1] 11
Here, we will simulating binomial models.
set.seed(1)
# Difference between means
bin.B0 <- qlogis(0.8) # survival in control group on logit-scale is approx. 0.8
bin.B1 <- qlogis(0.2) - qlogis(0.8) # survival in treatment group on logit-scale is 0.2
treatment <- sample(c("control","spray"), size = 600, replace = TRUE)
bin.y <- bin.B0 + bin.B1*(treatment == "spray")
p <- plogis(bin.y) # back-transformed probabilities for each 100 observations
summary(p)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.200 0.200 0.800 0.508 0.800 0.800
y <- rbinom(n = 600, size = 1, prob = p) # size of trial is 1
bin.means <- glm(y ~ treatment, family = binomial)
coef(bin.means)
## (Intercept) treatmentspray
## 1.504077 -2.774540
plogis(coef(bin.means)[1])
## (Intercept)
## 0.8181818
plogis(sum(coef(bin.means)))
## [1] 0.2191781
# Binomial regression
binr.B0 <- -9
binr.B1 <- 0.13
age <- runif(1000, min = 0, max = 100)
bin.probs <- plogis(binr.B0 + binr.B1*age)
br.y <- rbinom(n = 1000, size = 1, prob = bin.probs) # size of trial is 1
bin.reg <- glm(br.y ~ age, family = binomial)
coef(bin.reg)
## (Intercept) age
## -9.5998542 0.1409979
plot(0:100, plogis(coef(bin.reg)[1] + coef(bin.reg)[2]*0:100), type = "l",
ylab = "Probability of mortality", xlab = "age")
Third, we will learn how to sample from the Poisson distribution.
n <- 100000 # Sample size
lambda <- 5 # Average # individuals per sample
sample <- rpois(n = n, lambda = lambda)
head(sample)
## [1] 6 12 5 9 3 5
par(mfrow = c(2,1))
hist(sample, col = "grey", main = "Default histogram")
plot(table(sample), main = "A better graph", lwd = 3, ylab = "Frequency")
plot(0:20, dpois(x = 0:20, lambda = 5), ylab = "density function", xlab = "Counts")
plot(0:20, ppois(q = 0:20, lambda = 5), ylab = "cumulative density function", xlab = "Counts")
ppois(10, lambda = 5, lower.tail = T) # probability of getting a value 10 or less
## [1] 0.9863047
ppois(10, lambda = 5, lower.tail = F) # probability of getting a value 10 or more
## [1] 0.01369527
qpois(0.9, lambda = 5) # the 90% percentile is 8
## [1] 8
qpois(0.1, lambda = 5) # the 10% percentile is 2
## [1] 2
Finally, we will learn to simulate Poisson models.
set.seed(1)
# Differences between two Poisson means
beta0 <- log(7) # mean of 7
beta1 <- log(7 - 3) # mean of 10
site <- sample(c("1","2"), size = 500, replace = TRUE)
y_sim <- rpois(n = 500, lambda = exp(beta0 + beta1 * (site == "2")))
pois.mod <- glm(y_sim ~ site, family = poisson)
Anova(pois.mod)
## Analysis of Deviance Table (Type II tests)
##
## Response: y_sim
## LR Chisq Df Pr(>Chisq)
## site 3466.6 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Poisson regression
beta0 <- 1
beta1 <- 0.2
x <- seq(0, 1.5, length.out = 500) # ran
mu <- exp(beta0 + beta1 * x) # line of best fit
y <- rpois(n=500, lambda=mu)
plot(mu, x, type = "l")
points(y, x)
pois.reg <- glm(y ~ x, family = poisson)
Anova(pois.reg)
## Analysis of Deviance Table (Type II tests)
##
## Response: y
## LR Chisq Df Pr(>Chisq)
## x 14.341 1 0.0001525 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(pois.reg)
##
## Call:
## glm(formula = y ~ x, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7435 -0.8269 -0.0903 0.6515 2.9346
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.99886 0.05199 19.212 < 2e-16 ***
## x 0.21809 0.05767 3.782 0.000156 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 596.11 on 499 degrees of freedom
## Residual deviance: 581.76 on 498 degrees of freedom
## AIC: 1999.8
##
## Number of Fisher Scoring iterations: 5
The following distributions have their relative random generation functions:
rbernoulli()rnbinom()runif()rgamma()mpcmp::rcomp() in package mpcmprbeta()rexp()This is just the surface of the many frequentist models you could use to analyze your data. For example, we did not delve into the world of multivariate statistics for understanding which set of response variables explains the most variation in your data, such as Principal Component Analysis (PCA), Linear discriminant Analysis (LCA), or Partial Least Square Regression (PLSR).
Two good resources for multivariate statistics would be “A Primer of Ecological Statistics” by Gotelli and Ellison (Ch 12) and the first chapter of “A Beginner’s Guide to Generalized Additive Models with R” by Zuur, which is also a good introduction to GAMs where your linear predictor depends on some smooth function (rather than parametric function) of some covariate.
If you want to explore simulating multivariate data, Dr. Eric Scott developed a R package called Holodeck for simulating multivariate data.
Here are a list of additional resources for statistics behind these models and more example R code:
Bolker, B. (2008) Ecological models and data in R. Princeton University Press, Princeton, New Jersey, USA. (https://ms.mcmaster.ca/~bolker/emdbook/)
Burnham, K. P. and Anderson, D. R. (2002) Model Selection and Inference: A Practical Information-Theoretic Approach. Second Edition. Springer, New York, New York, USA.
Gotelli, N.J. and Ellison, A.M. (2012) A Primer for Ecological Statistics. Second edition, Sinauer Associates Publishers.
Kery, M. (2010) Introduction to WinBUGS for Ecologists. Academic Press, Burlinton, Massachusetts, USA.
White G. C. , and Cooch E. (2005). Program MARK: A Gentle Introduction. 4th edn. (http://www.phidot.org/software/mark/docs/book/.)
Zuur, A.F, E.N. Ieno, and E. Meesters. (2009) A Beginner’s Guide to R. Springer, New York, New York, USA.
Zuur, A.F. (2012) A Beginner’s guide to Generalized Additive Models with R. Highland Statistics Ltd, Newburgh, NY.